CLASSIFIERS BASED ON BAYES DECISION THEORY¶
Introduction¶
In this unit, we discuss classifiers based on Bayesian decision theory, which are based on the idea of design classifiers that classify an unknown pattern in the most probable class. So the So the question is how to define what most probable means.
Given a classification task of $M$ classes, $\omega_1, \omega_2, \ldots, \omega_M$, and an unknown pattern, which is represented by a feature vector $x$, we form the $M$ conditional probabilities $$P(\omega_i|X=x), \ i=1,2,\ldots,M$$ that is, the probability that the unknown pattern belongs to the respective class $\omega_i$, given that we know its feature vector takes the value $x$. These are also referred to as a posteriori probabilities.
The classifiers based on Bayes decision theory quantify the most probable class, by computing either the maximum of these $M$ values or, equivalently, the maximum of an appropriately defined function of them. The unknown pattern is then assigned to the class corresponding to this maximum
Bayes Decision Theory¶
We will initially focus on the two-class case. Let $\omega_1$, $\omega_2$ be the two classes in which our patterns belong.
$$P(\omega_1) \approx \frac{N_1}{N}$$$$P(\omega_2) \approx \frac{N_2}{N}$$We assume that the a priori probabilities $P(\omega_1)$ $P(\omega_2)$ are known. This is a reasonable assumption, because even if they are not known exactly, they can easily be estimated from the available training feature vectors.
If $N$ is the total number of available training patterns, and $N_1$ $N_2$ of them belong to $\omega_1$ and $\omega_2$, respectively, then
$$p(x | \omega_i), \ i=1,2$$We also assume that the class-conditional probability density functions
are known. These describe the distribution of the known feature vectors in each of the classes.
If these are not known, they can also be estimated from the available training data.
The pdf (probability density function) is sometimes referred to as the likelihood function of $\omega_i$ with respect to $x$. If the feature vectors can take only discrete values, density functions $p(x | \omega_i)$ become probabilities (mass probability function) and will be denoted by $P(x| \omega_i)$.
Using Bayes' rule, we can calculate the conditional probabilities: $$P(\omega_i | X=x) = \frac{p(x| \omega_i)P(\omega_i)}{p(x)}$$ where $p(x)$ is the pdf of $x$ and for which we have $$p(x) = \sum_{i=1}^2 p(x|\omega_i) P(\omega_i)$$
The Bayes classification rule can now be stated as:
$$\text{If} \qquad P(\omega_1 | x) > P(\omega_2 | x), \ x \rightarrow \omega_1$$$$\text{If} \qquad P(\omega_1 | x) < P(\omega_2 | x), \ x \rightarrow \omega_2$$In the case of equality, $P(\omega_1|x) = P(\omega_2|x)$, the pattern is arbitrarily assigned to one of the two classes.
So the only task is to estimate:
- $P(\omega_i)$ (priori probabilities)
- $p(x|\omega_i)$ (likelihood function of $\omega_i$ with respect to $X$)
Using Bayes' rule, the decision can equivalently be based on the inequalities $$p(x|\omega_1) P(\omega_1) \gtrless p(x|\omega_2) P(\omega_2)$$ $p(x)$ is not taken into account, because it is the same for all classes and it does not affect the decision.
Example¶
Iris dataset¶
It is a dataset about 3 types of flowers:
Which contains information on 4 features of each flower:
- Length and width of the sepal
- Length and width of the petal
The data was collected by botanist Edgar Anderson and used by statistician Ronald Fischer to show an example of his linear discriminant analysis.
Let $$X : \Omega \rightarrow \mathbb{R}^2$$ be a random vector that represent 2 features of a pattern set.
We assume that each pattern in $\Omega$ belongs to a class to a specific class:
- Setosa ($\omega_1$).
- Versicolor ($\omega_2$).
- Virginica ($\omega_3$).
Estimated density function of $X$ $$p(\textbf{x})$$¶
Probability of each class (priori probabilities) $$P(\omega_i)$$¶
Probability of class 0: 0.3333333333333333 Probability of class 1: 0.3333333333333333 Probability of class 2: 0.3333333333333333
Class-conditional probability density functions $$p(x|\omega_i)$$¶
Bayesian classification rule¶
Clase 0: [0.04133813] Clase 1: [0.01247693] Clase 2: [0.00812357]
Appendix¶
Definition: Conditional Probability
Let $(\Omega, \mathscr{F}, P)$ be a probability space and let $B\in \mathscr{F}$ be an event such that $P(B)>0$.
For any $A\in \mathscr{F}$, we define the conditional probability of $A$ given $B$, that we will denote $P(A|B)$ by $$P(A|B) = \frac{P(A\cap B)}{P(B)}$$
Proposition
Let ($\Omega, \mathscr{F}, P)$ be a probability space and let $B\in \mathscr{F}$ be an event such that $P(B)>0$. Then
- $P(\cdot|B)$ is a probability measure on the measurable space $(B, \mathscr{F}_B)$, where $\mathscr{F}_B=\{A\cap B | A\in \mathscr{F}\}$.
- If $A$ and $B$ are independent events, then
- [Bayes' rule] If $\{A_i\}_{i=1}^n \subset \mathscr{F}$ is a measurable partition of $\Omega$, such that $P(A_i) > 0, \ i=1,\ldots,n$. Then
Proof of Bayes' rule
- Let $\{A_i\}_{i=1}^n \subset \mathscr{F}$ be a measurable partition of $\Omega$, such that $P(A_i)>0$ for every $i=1,\ldots,n$.
- Then $\Omega = \cup_{i=1}^n A_i$ and the $A_i'$s are disjoint.
- Then $$P(B) = P(\Omega \cap B)= \sum_{i=1}^n P(A_i \cap B)$$
- Therefore
and Bayes' rule is proved.
For case $\{A_1, A_2\}$, Bayes' rule ensures that $$P(A_1| B) = \frac{P(B|A_1)P(A_1)}{P(B|A_1)P(A_1)+P(B|A_2)P(A_2)}$$ equivalently for $P(A_2|B)$
References¶
- González-Barrios Murguía, J.M. (2011). Lecture Notes on Probability Theory, UNAM-IIMAS.
- Theodoridis, S. & Koutroumbas, K. (2009) Pattern Recognition, 4th ed., Academic Press.